GitHub - Data Extraction

The file ../data/github-repositories-2015-02-17 contains a list of GitHub repositories that are candidates to store a package related to R. Those candidates were collected from the activity on GitHub between 2013 and 2014 (included). Those candidates all contain a DESCRIPTION file at the root of the repository.

We git clone-ed each of those repository. This notebook will parse those git repositories and extract the DESCRIPTION file of each commit.


In [1]:
import pandas
from datetime import date

We will make use of the following commands:

  • git clone <url> <path> where is the url of the repository and is the location to store the repository.
  • git log --follow --format="%H/%ci" <path> where will be DESCRIPTION. The output of this command is a list of / for this file.
  • git show <commit>:<path> where is the considered commit, and will be DESCRIPTION. This command outputs the content of the file at the given commit.

In [2]:
github = pandas.DataFrame.from_csv('../data/github-repositories-2015-02-17.csv')
repositories = github[['owner', 'repository']]

# Date is 2015-05-04 because we love starwars and because we cloned the repositories at this date
FILENAME = '../data/github-raw-{date}.csv'.format(date='2015-05-04')

# Root of the directory where the repositories were collected
GIT_DIR = '/data/github/'

We will retrieve a lot of data, we can benefit from IPython's parallel computation tool.

To use this notebook, you need either to configure your IPController or to start a cluster of IPython nodes, using ipcluster start -n 4 for example. See https://ipython.org/ipython-doc/dev/parallel/parallel_process.html for more information.

It seems that most recent versions of IPython Notebook can directly start cluster from the web interface, under the Cluster tab.


In [3]:
from IPython import parallel
clients = parallel.Client()
clients.block = False # asynchronous computations
print 'Clients:', str(clients.ids)


Clients: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [4]:
def get_data_from((owner, repository)):
    # Move to target directory
    try:
        os.chdir(os.path.join(GIT_DIR, owner, repository))
    except OSError as e: 
        # Should happen when directory does not exist
        return []
    
    data_list = []
    
    # Get commits for DESCRIPTION
    try:
        commits = subprocess.check_output(['git', 'log', '--format=%H/%ci', '--', 'DESCRIPTION'])
    except subprocess.CalledProcessError as e:
        # Should not happen!?
        raise Exception(owner + ' ' + repository + '/ log : ' + e.output)
        
    for commit in [x for x in commits.split('\n') if len(x.strip())!=0]:
        commit_sha, date = map(lambda x: x.strip(), commit.split('/'))
        
        # Get file content
        try:
            content = subprocess.check_output(['git', 'show', '{id}:{path}'.format(id=commit_sha, path='DESCRIPTION')])
        except subprocess.CalledProcessError as e:
            # Could happen when DESCRIPTION was added in this commit. Silently ignore
            continue
        
        try:
            metadata = deb822.Deb822(content.split('\n'))
        except Exception as e: 
            # I don't known which are the exceptions that Deb822 may throw!
            continue # Go further
            
        data = {}
        
        for md in ['Package', 'Version', 'License', 'Imports', 'Suggests', 'Depends']:
            data[md] = metadata.get(md, '')
        
        data['CommitDate'] = date
        data['Owner'] = owner
        data['Repository'] = repository
        data_list.append(data)

    # Return to root directory
    os.chdir(GIT_DIR)
    return data_list

In [5]:
data = []

clients[:].execute('import subprocess, os')
clients[:].execute('from debian import deb822')
clients[:]['GIT_DIR'] = GIT_DIR

balanced = clients.load_balanced_view()

items = [(owner, repo) for idx, (owner, repo) in repositories.iterrows()]

print len(items), 'items'
    
res = balanced.map(get_data_from, items, ordered=False, timeout=15)

import time
while not res.ready():
    time.sleep(5)
    print res.progress, ' ', 
    
for result in res.result:
    data.extend(result)


5150 items
331   674   709   738   766   792   827   855   874   889   906   942   969   999   1032   1059   1092   1107   1141   1183   1219   1248   1286   1350   1369   1414   1475   1539   1593   1636   1685   1727   1750   1760   1798   1844   1873   1908   1962   2005   2061   2102   2168   2215   2255   2300   2317   2353   2398   2435   2470   2488   2542   2587   2634   2680   2729   2772   2796   2830   2875   2911   2956   3003   3036   3077   3082   3115   3163   3194   3259   3307   3360   3383   3411   3441   3494   3534   3578   3615   3625   3634   3658   3699   3744   3803   3847   3911   3972   3986   4041   4089   4117   4178   4252   4318   4358   4366   4395   4433   4491   4532   4579   4620   4649   4699   4754   4806   4850   4914   4974   4983   5040   5078   5143   5150  

In [7]:
df = pandas.DataFrame.from_records(data)
df.to_csv(FILENAME, encoding='utf-8')
print len(df), 'items'
print len(df.drop_duplicates(['Package'])), 'packages'
print len(df.drop_duplicates(['Owner', 'Repository'])), 'repositories'
print len(df.drop_duplicates(['Package', 'Version'])), 'pairs (package, version)'


61846 items
5839 packages
5134 repositories
31374 pairs (package, version)

In [ ]:
df